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ABSTRACT 

We de scribe a method performing K;-projection using the fast Gauss transform of 
IStrainI (|l99lf ). We derive the theoretical performance, and simulate the actual perfor- 
mance for a range of w for a canonical array. While our implementation is dominated 
by overheads, we argue that this approach could for the basis of a higher-performing 
algorithms with particular application to the Square Kilometre Array. 
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1 INTRODUCTION 

Interferometers are composed of an array of antennas ar- 
ranged in a 3 dimensional volume. For long observations, 
even instantaneously planar arrays have baselines that are 
non-coplanar due to Earth rotation synthesis. Correcting 
non-coplanar baselines in interferometers require the use 
of a de-projection technique to co mpensate for the n on- 
coplanarity. To date, it;- project ion (jCornwell et al.ll2008l ) is 
the most computationally efficient algorithm known. 

Interferometric images are usually formed by convolu- 
tional resampling of the measured visibilities on a regularly 
sampled, two dimensional uv plane. Convolutional resam- 
pling operates by multiplying each visibility by a convolution 
function and adding the result to the uv-plane. In the sim- 
plest case, the convolution function is used for anti-aliasing 
purposes, although it can be used to compensate for primary 
beam affects, or in the case of it;- project ion, to compensate 
for non-coplanar baselines. K;-projection uses a convolution 
function that is dependant on w, which is the convolution 
of an anti-aliasing function, and a Fresnel term. 

While m ore ^efficient that other methods 

(|Cornwell et al.1 l2008 l). It;- project ion does have some 
shortcomings. Firstly, the convolution functions take time 
to compute, and for a short observation, can dominate 
the time to compute an image. Secondly, the amount of 
memory to store the convolution functions can be large, and 
can exhaust the capabilities of a computing node. Finally, 
multiply-add operation is generally memory-bandwidth 
bound, so that the central processing unit (CPU) spends 
the majority of the time waiting for the data to arrive, and 
relatively little time doing the multiply-add operation. 
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The memory constraints of it;- project ion are of partic- 
ular concern. The trend in computing has been for memory 
bandwidth to increase much more slowly than arithmetic ca- 
pacity, which has roughly doubled every 18 months (Moore's 
law), with this trend likely to continue for the foreseeable 
future. Therefore, the performance of memory-bandwidth 
bound algorithms such as it;- project ion will not improve as 
quickly as an algorithm which is bound by arithmetic capac- 
ity. Furthermore, modern high performance computers rely 
on having a very large number of small nodes, with each 
node having relatively limited memory. The requirement to 
store large convolution functions is, therefore, at odds with 
having nodes with small amounts of memory. 

Memory efficient algorithms, therefore, will be an im- 
portant step on the path to implementing wide-field imag- 
ing capabilities for the Square Kilometre Array (SKA). We 
have embarked on an program to research new approaches 
to interferometric imaging, with particular emphasis on al- 
gorithms that align more closely with the memory-bound 
computing that will be available in the time frame of the 
SKA. 

In this paper we describe a different approach to it;- 
projection. The key to our method is to represent the con- 
volution function, including the it;- project ion term, and an 
anti-aliasing function, as a complex Gaussian. The grid- 
ding and degridding problems can then be s olved using t he 
variable-width fast Gauss transform fFGT: IStrainI (Il99ll)). 
One advantage of this approach is that it does not require 
convolution functions to be computed and stored, and the 
theoretical memory bandwidth is less than standard grid- 
ding, in some situations. 

In section[2]we describe Gaussian anti-aliasing functions 
used for it; projection and in section [3] we apply these func- 
tions with the Fast Gauss Transform, to the gridding and 
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degridding problems. In section Owe derive the theoretical 
operations and memory requirements, and in section [5] we 
compare the theoretical requirements for our method and 
the standard methods. In section \6\ we compare the theo- 
retical results with a real- world implementation. We discuss 
our results in [7] and we draw our conclusions in section [51 



2 M^-PROJECTION WITH A GAUSSIAN 
ANTI-ALIASING FUNCTION 

In this section we describe the gridding process, and use 
a Gaussian anti-aliasing function to obtain a closed form 
expression for the convolution function. 

We acknowledge that the anti-aliasing properties of 
a Gaussian are not ideal. Typical interferometric imaging 
uses prolate spheroidal wa vef unctions , which have a optimal 
out-of-band rejection (Schwab 1980). Nonetheless, Gaus- 
sian functions have simple analytical relationships (e.g. the 
convolutions and Fourier transforms of Gaussians are both 
Gaussians) that allow us to proceed. 



2.1 Gridding 

The aim of convolutional gridding is to take visibilities sam- 
pled on arbitrary u,v,w coordinates and interpolate them 
onto a regular grid so that a 2D fast Fourier transform 
(FF T) can be used to e stimate the sky brightness distribu- 
tion (Taylor et al.""l999', p. 134fr). Mathematically, the grid 
is evaluated at a point {uc, Vc) by multiplying each visibility 
by a shifted convolution function according to: 

M 

F(Uc,Vc) = ^C(l^c - Uk,Vc - Vk,Wk)V{Uk,Vk,Wk) (1) 

where M is the number of visibilities in to be gridded, 
and w are the coordinates of the projected baseline (in wave- 
lengths), C{u, V, w) is a convolution function and V{u, v, w) 
is the measured visibility weight. Typically, C is assumed 
to be zero outside some region of support (6-8 pixels). If 
we take no account for the w term, then the convolution 
function is chosen to be an anti-aliasing function, and is 
independent of w, i.e. C{u,v,w) — A{u,v). If we a ccoun t 
for the w term using project ion of ICornwell et all (|20 
then the convolution function is given by 



C{u^ V, w) = A{u, v) * G{u, V, w) 



(2) 



where * denotes convolution, A(u^ v) is an anti-aliasing func- 
tion, and 



G{u^ v^w) — — exp i^—nilu^ + v^]/vi?j 



(3) 



is the Fresnel diffraction term required by the w projection 
algorithm. 



2.2 Radially symmetric Gaussian anti-aliasing 
functions 

For a fixed equation [2] is the convolution of the anti- 
aliasing function with a complex Gaussian. If we choose a 
Gaussian for the anti-aliasing function A, the resulting con- 
volution function is also Gaussian. To simplify notation, we 



will consider the radially symmetric problem by introducing 
a new parameter — -\-v^ , and can now write the Fresnel 
term as: 



G{t,w) 



— exp 

w \ w 



— exp 

w 



(4) 
(5) 



with Sg = —w/tt. We can also write the Gaussian anti- 
aliasing function as 



-t 

A{t) = exp ( — 



(6) 



The gridding function is given by the convolution G{t) * 
A{t), which is another Gaussian, whose variance is equal to 
the sum of the variances of the original Gaussians, i.e., to 
within a scaling factor: 



C{t,w) = A{t)^G{t,w) 

= — exp -t — 

w \ oa-\- iog J 

,2 



(7) 
(8) 
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real envelope complex chirp 
The width of the envelope is given by 



5a ' 

and width of the complex chirp is 
5\ + 5l 



5i 



5g 



(10) 



(11) 



(12) 



(13) 



The convolution function is, therefore, a complex Gaus- 
sian, which can be expressed as the product of real- valued 
Gaussian envelope (the 5r term in equation [T0|) with a com- 
plex chirp (the 5i term). Clearly, as \w\ increases, the width 
of the Gaussian envelope 5r increases. 

For a given visibility indexed by j the form of the con- 
volution function can be calculated from its w coordinate 



usmg: 

C(t, Wj) — — exp ( — ^ 
where 

Sa — Sa — 1 — 

7T 

is the width of the relevant convolution function, and Wj is 



(14) 



(15) 



the w coordinate of the visibility. 



2.3 Conversion from wavelengths to pixels 

Up to this point, the u,v,w co-ordinates and the Gaussian 
widths have been in units of wavelength. For the remainder 
of this paper, width will often be quoted in terms of pixels. 
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We use the terms pixel and uv-ce\\ interchangeably. To con- 
vert from wavelengths to pixels, a size of uv-ceW is required. 
The size of the uv-ce\\, (in wavelengths) is set by the inverse 
of the desired field of view (in radians) i.e. = 1/^f.o.v 

Therefore, a Gaussian width in pixels can be calculated 
from the width in wavelength by: 



^(pixels) 



6 (wavelengths) 



(16) 



3 W PROJECTION WITH THE FAST GAUSS 
TRANSFORM 

3.1 A brief introduction to the Fast Gauss 
Transform 

In this section we introduc e the Fast Gauss Transform 
(FGT) with variable scales (jStrainl [ToQll ) (hereafter S91), 
which is an approximate technique for calculating the sum 
of Gaussians. We conform to the language of S91, which 
uses the term 'source' to describe a Gaussian function with 
a given position, amplitude and width. We will describe the 
application of the FGT onto the gridding problem in section 
[321 

The FGT works by partitioning the evaluation region 
into square boxes, with each box containing a set of Taylor 
coefficients (Fig. [1]). For each Gaussian 'source', the Taylor 
coefficients for the boxes surrounding the source position are 
updated, with the size of the updated region being defined 
by the 'width' of the source, i.e. wider Gaussians update 
more boxes. When all sources have been applied, the Taylor 
coefficients are evaluated at any number of target locations. 
The algorithm is parameterised by three numbers: r, which 
sets the box size, p, which defines number of Taylor coef- 
ficients to consider, and 6 which sets the smallest size of 
Gaussian that will be considered. Reducing the box size, or 
increasing the number of Taylor coefficients increases the 
accuracy of the sum. 



3.2 Gridding with the Fast Gauss Transform 

Now that we have a Gaussian convolution function (Equa- 
tion I14p , we can solve the sum of Equation [T] the FGT with 
source- dependant scales. In the language of gridding visibil- 
ities, the evaluation region is the uv plane. The 'sources' are 
the visibilities, with the source position being given by the 
coordinates of the visibility on the uv plane. The width of 
the source is related to the w coordinate of the visibility as 
in Equation [151 with the minimum width when it; = 0, being 
essentially the width of the anti-aliasing function. When all 
the visibilities have been applied to the Taylor coefficients, 
the Taylor coefficients are evaluated on a regular grid, for 
each uv cell on the uv plane. 

For a detailed description of the algorithm, we refer the 
reader to S91. For the purposes of explaining our implemen- 
tation, we outline the algorithm in the following text, us- 
ing similar notation to S91. This description applies to the 
one-dimensional case, but S91 describes the relationships to 
extend it to the multi-dimensional case (using multi-index 
notation), for which the 2D case is relevant here. Addition- 
ally, this description is only valid for real Gaussians. The 




Figure 1. This figure illustrates the process of gridding or degrid- 
ding a single visibility. The uv plane of size ATpi^ is partitioned 
into boxes size L. The box size can be the same size as a uv cell. 
Each box contains (p — + l)"^ Taylor coefficients. To grid or de- 
grid a visibility indexed by j, all boxes within Rj of the visibility 
uv position (tj) are updated (circular blue shaded region). Within 
each box, Taylor coefficients are updated with no cheating. If 



cheating is enabled (i.e. Pd > then only q'^ of the (p — pd- 
Taylor coefficients (orange shaded region) are updated. 
been exaggerated for clarity. 



-1)2 

has 



extension to complex Gaussians will be considered in the 
next section. 

The algorithm proceeds as follows: 



(i) Given values of r and p calculate: 



1 - 
where 



p+ 1 



< 1 



(17) 



(18) 



(ii) Partition the uv plane into the set of boxes B with 
centers ts, and side length L = rV26 parallel to the axes, 
where 6 — min^ 5j — 5a is the smallest Gaussian width to 
be considered. 

(iii) For each visibility indexed by j, calculate the range 
Rj — \/—Sj log e. Also calculate the order q ^ p satisfying 

^ e (19) 



where rqj = r ^J'eJJ6j{q~^^l) . 

(iv) For each box within Rj of the visibility position tj, 
update the Taylor coefficients according to: 



4 K. W. Bannister et al. 



Cb 



1/31/2 



V~6 



(20) 



where (3 runs from to ^, Vj is the visibihty weight, and hp 
is the Hermite function (see S91). 

(v) Once aU the visibihties have been apphed, evaluate 
the Taylor coefficients at each cell in the uv plane, by finding 
the box in which the uv cell is situated, and calculating the 



tc — ts 
V6 



(21) 



where tc is the position of the uv cell, and /3 runs from to 
P- 

The error in the evaluation of the sum is bounded by: 



where 

M 

r and p are chosen to make l^d ^ Ve. 



(22) 



(23) 



3.3 Extension to complex Gaussians 

The formulation of S91 is explicitly for real Gaussians, i.e. 
with Sj, Vj G 5^, but in our case Sj,Vj G C. A full derivation 
of the algorithm for complex values is outside the scope of 
this work . The important questions are whether the series 
of Equation [20l converges (and for what values of a complex 
6) and what is a suitable range (Rj). 

Empirically, we have determined that: 



(i) Using a range of Rj = ^y^6RJ\oge provides good 
results. This also makes sense because it is the amplitude of 
the envelope that determines the range. 

(ii) For calculating the order q (Equation [19} we used 



max{6A, Si) 



(24) 



Other possibilities (e.g. Sj = 6r) reduced q too aggres- 
sively as w increased, which resulted in very large errors. 

In our case, the minimum value of 6 occurs when w = 
which also coincides with ^(^) = so the box size is easily 
set as equal to L = rV2^A. 



3.4 Optimisations 

S91 describes a number of optimisations that can be used in 
the FGT with source-dependent scales: 

(i) S91 introduces a single set of Taylor coefficients, po- 
sitioned at the center of the uv plane, to capture Gaussians 
with the largest scales. Visibilities larger than an upper scale 
size Sj > b can be added to this single set, rather than the 
sets associated with the boxes. S91 recommends the upper 
scale size be at least 10% of the size of the evaluation re- 
gion, in order to allow for practical precision in evaluating 
the large number of Taylor coefficients. In practice, most ar- 
ray configurations would have relatively few baselines with 



such a large w that their convolution functions would span 
> 10% of the uv plane, so this optimisation has limited im- 
pact. 

(ii) S91 introduces a lower scale a below which the Gaus- 
sian sum is directly evaluated. If there are only a few Gaus- 
sians with scales smaller than a, we can increase the box size, 
which reduces the memory and arithmetic capacity required, 
while incurring minimal penalty from direct evaluation. In 
practice, any interferometric observation is likely to be ar- 
ranged such that the mean value of w is approximately zero. 
Therefore, there are unlikely to be outlying, small Gaussians 
that would allow for increasing the box size. Once again, this 
optimisation has limited impact. 

(iii) S91 proposes doing direct evaluation for boxes con- 
taining only a few evaluation points. As we are evaluating 
on a regular grid, we have the same number of evaluation 
points inside each box, so applying this optimisation would 
degenerate into standard gridding. 

(iv) S91 proposes a method for avoiding needlessly accu- 
rate computation, i.e. for large scales (which are smoother 
than small ones) more boxes are updated, but smaller num- 
ber of Taylor coefficients (i.e. the order q described in Section 
13. 2p per box can be updated. This optimisation is useful and 
derives the most benefit for visibilities with large scales (\w\ 
large, see sectionlH). We have incorporated this optimisation 
in Equation [T9l 

(v) S91 notes that the equation to calculate the order 
(Equation [19)) overestimates the error by at sometimes two 
orders of magnitude. Memory bandwidth and operations can 
be saved by 'cheating' on the value of q by decrementing q by 
some value. This strategy is described in the section 13.4.11 

(vi) The fact that each uv-cell contains a number of Tay- 
lor terms affords an additional axis available for parallelisa- 
tion. The strategy is described in section [3.4.21 



3.4-1 Cheating 

There are two sources of error when a sample is gridded (see 
Figure [3|. The first is from truncating the support; that is, 
only boxes only within a finite range Rj = \/—Sr log e are 
updated. Convolutional gridding suffers from the same type 
of truncation error. The second source of error is from trun- 
cation of the Taylor terms; that is, only q < oo Taylor terms 
are updated. S91 notes that the equation for calculating the 
error from truncation of Taylor terms (Equation I19p over- 
estimates the error by several orders of magnitude. If this 
is the case, for a given value of e, the error will be domi- 
nated by the truncated support, and the contribution from 
the truncation of the Taylor terms will be negligible. In prin- 
ciple, therefore, one can reduce the number of Taylor terms 
without substantially increasing the overall error, as long as 
it remains less than the error from the truncation of support. 

In order to balance the errors from both effects, we in- 
troduce an additional parameter to the algorithm pd, which 
decrements value of q, so that the actual value of q that is 
used is q — max(q — pa, 0). This reduces the CPU and mem- 
ory requirements of updating a given box, at the expense of 
increasing the errors due to truncating the Taylor terms. 



5 



3.4-2 Parallelisation 

One interesting property of our approach is that affords an 
additional axis of parallehsation (the Taylor terms) in addi- 
tion to those available for standard gridding (typically, data 
parallelisation over frequency channels). We assume that the 
parallelism is achieved through message passing. 

In section |4] we show that the memory bandwidth and 
operations count for the FGT is dominated by updating the 
Taylor terms. As each of the Taylor terms is independent, 
one can in principle store each of the (p — + 1)^ Taylor 
terms on (up to) as many nodes, thereby dividing the per- 
node storage requirement by the number of nodes. For the 
case where q'^ is a multiple of the number of nodes, the work 
is spread equally among the nodes. Each node can update 
its Taylor terms for the boxes within range, and, the total 
memory bandwidth is increased by the number of nodes. 

One penalty of parallelising in this way is that each node 
must calculate the same interim values (e.g. the values of the 
Hermite polynomials). This duplication can be reduced by 
partitioning the Taylor terms among fewer nodes, with each 
node being responsible for a particular row or column of the 
Taylor terms. 

If q'^ is less than the number of nodes, then some nodes 
will have no work to perform, as they are responsible for 
Taylor coefficients that are not being updated. One must be 
careful, therefore, to choose an algorithm parameterisation 
i.e. (r,p and pa) that guarantees never to give q'^ less than 
the number of nodes. 

The efficiency of this approach depends crucially on the 
properties of the computing nodes. It is useful if the nodes 
are storage or memory bound, but not if the cost of com- 
puting the interim values dominates the computing time. 
As this tradeoff requires intimate knowledge of the partic- 
ular computing architecture, we will not pursue a detailed 
analysis of parallelisation in this paper. 

3.5 Degridding with the Fast Gauss Transform 

Degridding is used as part of the m ajor cycle of Cotton- 
schwab clean (Rau Cornwell 1201 ll ) . Degridding involves 
taking the dot-product of the uv plane with the convolution 
function shifted to the location of a visibility at an arbitrary 
location (i.e. not at the center of a uv cell). It is the dual of 
gridding. 

Degridding can be performed analogously to gridding 
using the fast Gauss transform with tar^get-dependent scales 
(S91). All the same comments about optimisations for grid- 
ding (Section l3.4p apply as for degridding. 

3.6 Key differences between classical gridding and 
the FGT 

The key conceptual differences between classical gridding 
and FGT gridding are described in Table [T] 



4 THEORETICAL PERFORMANCE 

Here we derive the theoretical performance of the gridding 
and degridding processes using the fast Gauss transform, 
and traditional convolutional gridding. 



4.1 Operations count of the fast Gauss transform 

4.1.1 Gridding 

To grid a single visibility of width 5j , the number of boxes 
that are within a circular region within the range Rj — 
\/—5r log e is approximately: 

NB = \(2\R,m+lf. (25) 

The the order ^ ^ p is chosen to satisfy the error esti- 
mate: 



where rqj = e6/6j{q + 1). 

To apply cheating (Section I3.4.1|) . we decrement the 
value of q, such that q' = max(q — Pd, 0). 

Now we consider work required to update the Taylor 
coefficients in each box fEa. [20)) . The key is to compute and 
store the Hermite functions, and powers of (^/^j)'^'/^ sepa- 
rately, and perform the sum at the end, by taking advantage 
of the product form of the multi-index notation for multiple 
dimensions. 

The Hermite function is a polynomial multiplied by a 
complex exponential. The argument for the complex expo- 
nential is the same for every order q , so only one calculation 
is required per set of Hermite function evaluations. The way 
in which a complex exponential is computed by a CPU is 
highly implementation-dependant, so we simplify the anal- 
ysis here and assume that it requires A^cexp floating point 
operations. 

We observe that the Hermite function of order q has 
only Nc = [q' /2\ -\- 1 nonzero coefficients. Second, we 
note that we will evaluate the Hermite functions for or- 
der ^ 13 ^ q with the same argument. Therefore, we 
can calculate the powers of the argument initially with q' 
operations. Evaluating each Hermite function requires only 
Nc additional operations. Therefore, to compute all Hermite 
functions up to order q' requires approximately: 

, hermite — 

/3 = 
Nc 

= A^cexp+^' + 2^[/3/2j +1 

/3 = 

- Nce.p+q' + Nc{Nc + l) (27) 

The multi-index powers of t/3 = (^/^j)'^'/^ can be 
computed recursively, by computing tn = {S/Sj)tn-2 with 
to = (V^j)^/^ and ti = {S/6jy. This requires 2{q' + 1) 
operations ^ n ^ 2q' . 

The weight Vj can be folded into the power values also, 
with 2{q + 1) operations. 

Finally, the full sum to update over two dimensions 
requires the multiplication of Hermite functions for 2 dimen- 
sions, and the powers, plus the accumulation, which requires 
3{q + 1)^ operations in total. 

The total number of operations to update a single box is 
therefore the sum of two Hermite evaluations (one for each 
dimension), plus the powers of S/Sj, the weights and the 
final sum, i.e.: 

iVgrid = 2{Nce.p + q' + Nc[Nc + l])+^q +l) + S{q +lf{28) 
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Table 1. Conceptual differences between classical gridding and gridding with the Fast Gauss Transform. 



Property 



Classical Gridding 



Fast Gauss Transform 



Accuracy: 

Anti-aliasing function: 
Fundamental scale: 



Exact 
Arbitrary. 

A uv-ce\\, set by imaging geometry. 



Gridding a single visibility updates: Cells on the w-u-plane. 



Approximate, but with controllable error. 
Gaussian only. 

A box, which can be smaller, or larger than a 
uv-ce\\. 

Taylor coefficients in a set of boxes. The cells of 
the uv-plane is calculated as a post-processing 
step. 



For small q , the cost is dominated by the cost of evalu- 
ating the complex exponential. For large q\ it is dominated 
by the multiply-add step in the final sum. 



4.1.2 

The operations count for the degridding process is similar 
to the gridding case. The order q' is evaluated identically, as 
are the Hermite functions and the powers of ^/ Sj . The only 
difference is that no weight is included, and the multiply-add 
stage includes the product of the two hermite functions, the 
power and the Taylor term, requiring 4(q' + 1)^ operations. 
The number of operations per visibility is therefore: 

iVdegrid = 2(iV,exp+^'+iVc[iVc + l])+2(^' + l)+4(^^ + l)^(29) 

Once again, for small q , the cost is dominated by the 
cost of evaluating the complex exponential. For large q , it 
is dominated by the multiply-add step in the final sum. 



4.1.3 Memory bandwidth 

Calculating memory bandwidth is complicated by the is- 
sue of cache hierarchy. We assume the simplest case: i.e. no 
caching. In practice, this is a reasonable assumption, as visi- 
bilities are often stored in no particular order. Therefore, the 
desired grid (and convolution functions in the case of con- 
volutional gridding) are essentially random, meaning that 
all memory accesses go to the main memory and bypass the 
caches. 

For the fast Gauss transform, gridding a visibility re- 
quires the updating of {q -\-1)'^Nb complex coefficients. The 
coefficients must be read into the processor, updated, and 
written back to memory, requiring 2{q' + I^Nb memory 
transactions per visibility. 

For degridding, the coefficients do not need to be writ- 
ten back to memory, therefore the only (q + 1)^Nb memory 
transactions are required. 



4.1.4 Storage 

For an image size of A^pix, the gridding operation requires 
storage for {N^\^/ (p — pd -\- 1)^ complex coefficients. De- 
gridding requires the same storage. 



4.2 Performance of convolutional gridding and 
degridding 

In this section we derive the operations count, memory 
bandwidth and storage required to perform gridding and 
degridding with stored convolution funcitons. In order to 
put the two methods on equal footing in terms of imaging 
performance, we will use Gaussian anti-aliasing functions 
for the It;- project ion, rather than the commonly-used pro- 
late spheroidal wavefunctions. This means that the imaging 
performance of the two approaches is essentially the same 
(except for the errors in truncating the Taylor series of the 
FGT) and the support size of the convolution functions is 
also easily computed. 

We assume standard w projection computes the con- 
volution functions in advance and stores them in memory. 
We assume a convolution function of 1-D size for the kth 
w-plane Mk = kAw, where k is an integer, and Aw differ- 
ence in size between different w planes. Usually the cached 
convolution function is oversampled by a factor k of 4-8 in 
order to accurately grid visibilities whose coordinates are 
not exactly in the center of a uv cell. 

We assume that we truncate the convolution function 
when the real envelope reaches a value e, which is at a dis- 
tance te = ^/—Sr log e from the centre of the convolution 
function. Therefore, the ID support size in pixels, of the 
A;th w-plane is (by substituting Equation I12p 



Mk = 2te 



= 2WlogeUA + 



JfAw^ 



(30) 



4.2.1 Operations count 

Gridding a visibility requires two operations per point 
(weight times convolution function, add to grid), for a total 
of 2Mk operations per visibility. Degridding also requires 
two operations per point (convolution function times grid, 
add to result) and and therefore requires 2M^ operations 
per visibility. 
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4-2.2 Memory bandwidth 

Gridding requires the convolution function, and the grid po- 
sition to be retrieved and written back to memory, requiring 
memory transactions per visibihty. Degridding has no 
requirement to write back the result, (as the intermediate 



L=l p=6 pa=0 L=2p = 15prf=0 

L=l p=Q =3 L—2p — 15p^ —3 



SO only 2M^ memory transac- 



sum is stored in a register) 
tions are required. 

4.2.3 Storage 

To compute the total memory required, we will deter- 
mine the amount required to store Nw it;-planes for val- 
ues of w between [0, -\-Wmax], uniformly sampled with width 
Aw = Wmax/Nw In practice, we need to store iL'-planes for 
0), so we will take the above result and multiply by 

two. 

The amount of memory required to store the convolu- 
tion functions is therefore equal to the sum of the oversam- 
pled convolution functions, i.e.: 



iVrr 



-4^ log e (^N^6a + ^ ^ [N^ + 1] [2N^ + |$^) 
-U'^oge^N^ (32) 



5 THEORETICAL PERFORMANCE 
COMPARISON FOR GRIDDING 

We consider the gridding problem for two scenarios, L = 1 
and L — 2. We assuming = 1 pixel, a required accuracy 
of e = 10~^, a 1 degree field of view and a 6 km maximum 
baseline at A = 0.2 giving a it^max of 30/cA. The convolu- 
tion function size at maximum w of Mwmax — 191 pixels. 
We assume the cost of evaluating a complex exponential is 
Ncexp — 100 flops. This estimate is also justified by mea- 
surements of our implementation (Fig. [5]). 

The FGT method for L — 1 and L = 2 is outperformed 
by standard gridding both in terms of operations and mem- 
ory bandwidth, if cheating is not enabled (Figure [2]). For the 
L — 1 case in particular, the operations count is dominated 
by the complex exponential. 

If cheating is enabled, then the situation is substan- 
tially improved. Once the w value reaches above a certain 
threshold, the order = 0, and only one Taylor coefficient is 
updated per box. In the L = 1 case, the memory bandwidth 
is reduced to 0.5 of the standard gridding case, because the 
FGT only requires only two memory transactions per pixel 
(read + write coefficient), while standard gridding requires 
three (read pixel + read convolution function + write pixel) . 
For L — 2 the improvement in memory bandwidth is even 
more with the FGT requiring 10 times less memory band- 
width than standard gridding. The FGT is more efficient as 
it only updates one coefficient for each box, which encap- 
sulates 4 pixels. The FGT case is also improved because it 
only updates boxes within a circular region, while standard 
gridding updates all pixels within a square. 

This encouraging result leads to a number of questions: 




Figure 2. Gridding with the fast Gauss transform can save mem- 
ory bandwidth for large support sizes, at a cost of floating point 
operations. Plotted here is the ratio of operations (top panel) 
and memory bandwidth (bottom panel) for the fast Gauss trans- 
form compared with convolutional gridding (see Section [J]). The 
X-axis is the value of the w coordinate, which is also a measure 
of the width of the convolution function (Equation I15|) . The tar- 
get error for each algorithm was set to e = 10~^. The fast Gauss 
transform was configured with 5a = ^ pixel and we have assumed 
Ncexp = 100. The step drops in the ratio occur when the FGT 
reduces the order q' for the Taylor truncation. For w > 14 kA, the 
FGT with L = 2 requires 21.8 times more operations but only 
0.1 times the memory bandwidth of classical gridding, but only 
if cheating is enabled with pd = 3. 



is the FGT bound by memory bandwidth? Can these per- 
formance gains be realised in practice? Can cheating with 
Pd = 3 give reasonable image errors? We will address these 
questions in the following section. 



6 IMPLEMENTATION 

We implemented the gridding and degridding algorithms as 
described in Section [3] in C++. As with the theoretical sim- 
ulation, we set the width of the anti-aliasing function equal 
to = 1 uv cell. We aim explore the parameter space sim- 
ilar to the theoretical analysis, i.e. around L = 1 and L — 2 
and ranges of errors around 10~^ < e < 10~^. 

To compare the gridding errors, we compared the rel- 
ative error of a visibility gridded with the FGT with the 
equivalent complex Gaussian (Equation [9]), calculated over 
256^ pixels (which is larger than the 191 pixels we expect for 



10""^ and w 



To compare the compute times, 



we compare the time to grid 100 visibilities of fixed u,v,w 
coordinates with the FGT, with the same number of visibili- 
ties gridded with standard gridding. The support size of the 
gridding kernel was chosen to have with equivalent error to 
the FGT error, i.e. pixels. We chose 101 w planes from 
to Wmax- The processing was performed on an Intel Core 2 
Duo processor running at 2.66 GHz, with 32 kB of LI cache, 
3 MB of L2 cache and with 8 GB of 1057 MHz DDR3 RAM. 
We consider here only the results for the gridding op- 
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eration, as the degridding algorithm results essentially the 
same. 

6.1 Error patterns 

A typical error pattern is shown in Figure [S] The absolute 
error in the gridded visibility contains two components. The 
circular component is due to the truncation to finite support, 
i.e. boxes outside the circle were not updated. The vertical 
and horizontal lines are due to truncation to a finite number 
of Taylor terms, with the largest error where the grid is 
evaluated near the boundary of adjacent boxes. The box 
structure of the gridding process can also be clearly seen in 
the FFT error, which also shows a box structure. 

6.2 Peak Image Errors vs w 

The peak error in the FFT of the visibility (i.e. the image 
plane), as a function of the w for a range of algorithm pa- 
rameterisations are shown in Figure [H To begin with, we 
consider the case for L ^1. Somewhat surprisingly. Equa- 
tion [iTl which describes the error in gridding a single visi- 
bility on uv plane, also predicts the error the image plane 
for many parameterisations and values of w. For p^z = 0, 
and small w, the errors are substantially smaller than the 
predicted e. As w increases, q' is reduced, and the error gen- 
erally matches the predicted e. For some parameterisations 
(e.g. r = 0.5, p — 7) there is a jump in error above predicted 
e at a particular w. 

For the L 2 case, for small the errors are at or be- 
low the theoretical limits in most cases, however in many 
cases the errors are considerably worse than for L — 1. For 
example, for w small in some cases (e.g. L — 2.5, p — 18, 
Pd = 0) the error increases far beyond the theoretical limit. 
For large the errors are several orders of magnitude worse 
than predicted. 

6.3 Processing time 

The measured processing time for the FGT vs. standard 
gridding is shown in Figure O Our implementation operates 
substantially slower than standard gridding for all parame- 
terisations and ranges of and the operations rate is well 
matched by the our theoretical model for the operations 
count. 



7 DISCUSSION 

7.1 FGT Implementation 

The results from testing our implementation are somewhat 
conflicting. On the plus side. Figured clearly shows that, for 
small values of Equation [19] is overestimating the trunca- 
tion error, and the value of q being used is too large. This 
was the original motivation for introducing cheating, and it 
is clear that the error is still overestimated even for p^z = 3 
in some cases. 

Unfortunately, for larger values of the fact that the 
error jumps above the predicted e, even with no cheating, 
implies that our empirical extension to the FGT to complex 
Gaussians, (in particular Equation [191 and Equation [2^ is 



not correct. As a result of this incorrectness, our current 
cheating algorithm has limited value. Even with p^z = 1, 
most parameterisations contain jumps in error of factors of 
few to 100 over the range oiw. A different mapping between 
w and q is required to maintain a constant error across w. 

Increasing the box size above L > 2.3 appears to have 
catastrophic affects on the error for small w. We suspect this 
is due either to numerical instability in evaluating high-order 
polynomials, or more likely, the Hermite expansions simply 
fail to converge when the Gaussian width is less than some 
function of the box size. 

Our implementation's processing time is clearly 
compute-bound as our computing times match our theoreti- 
cal model for the operations count almost exactly in most in- 
stances. The theoretical model suggests the operations count 
is dominated by the cost of computing the complex exponen- 
tial, and we expect that is the case, although other overheads 
may also be responsible. Sadly, we were not able to approach 
the memory-bound performance that originally motivated 
this work. 

7.2 Future work 

In spite of the disappointing performance of our implemen- 
tation, we hope that some of the ideas from this work could 
be extended in future. In particular: 

• The idea of having tuneable error is attractive. For ar- 
rays with high sidelobes in their synthesised beams (i.e. poor 
uv coverage), large gridding errors can be made, as long as 
they remain below the clean threshold for the given major 
cycle. 

• The required memory bandwidth of our algorithm is 
only weakly dependant on \w\^ so for arrays with very long 
baselines, our algorithm may be suitable. 

• The Hermite expansio ns were originally proposed by 
iGreengard StrainI (|l99lh for real Gaussians. It may be 
that faster- converging expansions can be found for complex 
Gaussians - particularly in the case for large w where the real 
envelope is smooth and the complex chirp contains many 
peaks. 

• The form of the convolution function is smooth at the 
centre and oscillates more rapidly as t increases, with the 
fastest oscillations being damped at the edges by the enve- 
lope. Currently, our approach is to use the same order for 
all boxes, which leads to the largest errors being made away 
from the centre of the convolution function (Figure [3|) where 
the oscillations are the fastest and largest. There may, there- 
fore, be some value in varying the order q' as a function of 
the distance from the centre, thereby applying the majority 
of the computational effort where the errors are likely to be 
worst. 

• We implemented the polynomial evalu ations in a na ive 
manner. More sophisticated methods exist (|Knuthlll997l . p. 
485ff) that could reduce the operations count and improve 
numerical stability. 

• To calculate the complex exponential, we used the 
CEXP function from the standard C library, which is accu- 
rate but slow. Faster, approximate methods are availabl^B- 

^ |http : //gruntthepeon . free . f r/ssemath/ 1 (e.g. (ISchraudolphl 
119981) 
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Figure 3. Error pattern for a visibility gridded at lu = lUmax for L = 0.7 and = 7. Each box is 256x256 uv pixels. The rows are the 
visibility computed with the full complex Gaussian, the FGT, the error, and the FFT of the error. The columns are the real, imaginary 
and absolute values of the data respectively. The circular component of the error is due to the truncated range, while the horizontal 
and vertical components of the error are due to the truncation of the Taylor series. This example was chosen because it illustrates both 
components clearly. 



For an implementation whose run time is dominated by 
CEXP such as ours, these methods could improve speed 
by a factor of a few, albeit at the cost of some accuracy. 

• Gaussian anti-aliasing functions are not well suited to 
imaging, as the alias rejection is not very good. The modern 
state of the ar t is to use prolate spheroidal wavefunctions 
(jSchwabl 11980). that have better alias rejection. Unfortu- 
nately, the Hermite functions are not a suitable basis for 
Taylor expansion of the prolate spheroidal wavefunctions. If 
suitable analytic multipole expansions of the convolution of 
the complex chirp, and the prolate spheroidal wavefunctions 
can be found, then they can be applied with the fast mul - 
tiple method (jGreengard RokhlinlflQSTl : Ig reengar dl ll988l ) 
to obtain a similar result as our FGT, but with better anti- 
aliasing properties. 

• The fast multipole method could also be used to 
compensate for the primary beam in the uv-p\a,ne (AW- 
projection, tehatnagar et al ll2008l )). Once again, this would 



require analytic multipole expansions of the required convo- 
lution functions. While this may not be possible with an 
arbitrary primary beam, it may be more tractable if we as- 
sume Gaussian primary beams. 

• For problems where the compute time of it;- project ion is 
dominated by the time to compute convolution functions (by 
Fourier transforming a phase screen in the image plane) , us- 
ing standing gridding with Gaussian anti-aliasing functions 
could be preferable, as they can be directly and efficiently 
computed in the uv domain. 

• Easily parallelised algorithms will clearly be important 
for large arrays such as SKA, where the computational work 
is likely to be spread over many thousands of nodes. Once 
the gridding problem has been distributed over the usual 
axes of frequency, pointing direction and polarisation, it is 
possible that problem may still be too large to be efficiently 
computed by a single node. For the gridding operation, the 
option of distributing and parallelising over Taylor coefii- 
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Figure 4. The image plane error of the FGT is a strong function of w and the algorithm parameterisations. Here we plot the relative 
error in the image plane of the FGT vs. w for a range of algorithm parameterisations. The solid lines are the measured error in the image 
plane and the dotted lines are the predicted values of e (Equation [17}. The top panels are for a box size L ~ 1, and the bottom panels 
are for a box size of L ~ 2 pixels. 
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Figure 5. Our implementation of gridding with the fast Gauss transform is substantially slower than standard gridding, and compute 
bound. Here we plot the ratio of compute time to grid 100 visibilities with the FGT vs. standard gridding, over a range of w. The solid 
lines are the ratio of the measured processing times, while the dotted lines are predicted ratio in operations count (the same as top 
bottom panel of Fig. [2]). The top panels are for a box size L ~ 1, and the bottom panels are for a box size of L ~ 2 pixels. 



cients adds an additional axis which an be exploited when 
distributing work among processors, over and above the tra- 
ditional axes. 



8 CONCLUSIONS 

We have described a procedure to perform project ion with 
the fast Gauss transform with variable scales, where the anti- 



aliasing function is chosen to be a Gaussian. The gridding 
problem is solved by the FGT with source variable scales, 
and the degridding problem by the FGT with target vari- 
able scales. While the theoretical efficiency of our approach 
is encouraging, we were not able to approach the the theoret- 
ical performance gains with our implementation. Nonethe- 
less, we find that w projection with approximate algorithms 
such as the FGT or fast multipole methods may yet have 
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promise, by having the attractive properties of tuneable er- 
ror, an additional parallelisation axis, and no calculation 
and storage of convolution functions. The methods require 
additional research, to improve the practical implementa- 
tion, find expansions with faster convergence, and find closed 
forms with better anti-aliasing properties. 
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